Comparative evaluation of term selection functions for authorship attribution
نویسنده
چکیده
Different computational models have been proposed to automatically determine the most probable author of a disputed text (authorship attribution). These models can be viewed as special approaches in the text categorization domain. In this perspective, in a first step we need to determine the most effective features (words, punctuation symbols, part-of-speech, bigram of words, etc.) to discriminate between different authors. To achieve this, we can consider different independent feature-scoring selection functions (information gain, gain ratio, pointwise mutual information, odds ratio, chi-square, bi-normal separation, GSS, Darmstadt Indexing Approach (DIA), and correlation coefficient). Other term selection strategies have also been suggested in specific authorship attribution studies. To compare these two families of selection procedures, we have extracted articles from two newspapers and belonging to two categories (sports and politics). To enlarge the basis of our evaluations, we have chosen one newspaper written in the English language (‘Glasgow Herald’) and a second one in Italian (‘La Stampa’). The resulting collections contain from 987 to 2,036 articles written by four to ten columnists. Using the Kullback–Leibler divergence, the chisquare measure and the Delta rule as attribution schemes, this study found that some simple selection strategies (based on occurrence frequency or document frequency) may produce similar, and sometimes better, results compared with more complex ones. .................................................................................................................................................................................
منابع مشابه
Etude comparative de stratégies de sélection de prédicteurs pour l'attribution d'auteur
The authorship attribution problem can be viewed as a categorization problem. To determine the most effective features to discriminate between different writers (or categories), we have evaluated seven feature selection functions (e.g., pointwise mutual information, information gain, odds ratio, !, or correlation coefficient). We have also considered two selection functions proposed in the cont...
متن کاملApplication of PROMETHEE method for green supplier selection: a comparative result based on preference functions
The PROMETHEE is a significant method for evaluating alternatives with respect to criteria in multi-criteria decision-making problems. It is characterized by many types of preference functions that are used for assigning the differences between alternatives in judgements. This paper proposes a preference of green suppliers using the PROMETHEE under the usual criterion preference functions. Comp...
متن کاملAuthorship Attribution in Bengali Language
We describe Authorship Attribution of Bengali literary text. Our contributions include a new corpus of 3,000 passages written by three Bengali authors, an end-toend system for authorship classification based on character n-grams, feature selection for authorship attribution, feature ranking and analysis, and learning curve to assess the relationship between amount of training data and test accu...
متن کاملTowards a better understanding of Burrows's Delta in literary authorship attribution
Burrows’s Delta is the most established measure for stylometric difference in literary authorship attribution. Several improvements on the original Delta have been proposed. However, a recent empirical study showed that none of the proposed variants constitute a major improvement in terms of authorship attribution performance. With this paper, we try to improve our understanding of how and why ...
متن کاملEPSMS and the Document Occurrence Representation for Authorship Identification - Notebook for PAN at CLEF 2011
This paper describes the participation of the PISIS team in the authorship identification track of PAN’11. We adopted two different strategies for the tasks of authorship attribution and authorship verification. For authorship attribution we performed experiments with a document occurrence representation using a standard classification-based approach. Results obtained with this approach were mi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- DSH
دوره 30 شماره
صفحات -
تاریخ انتشار 2015